Improving TTS by higher agreement between predicted versus observed pronunciations

نویسندگان

  • Yeon-Jun Kim
  • Ann K. Syrdal
  • Matthias Jilka
چکیده

This paper looks at improving unit selection text-to-speech (TTS) quality by optimizing the agreement between frontend and speech database. We focused, in particular, on two classes of problems causing degradation in synthesis quality: 1) realization of /d/ and /t/1 sounds and 2) confusions of unstressed vowels, especially with schwas. We investigated two approaches to tackling these problems. First, we improved the phonological processing in the front end modules. Further improvement resulted from creating speaker-dependent pronunciation lexicons for automatic speech labeling of our voice databases. This change helped in alleviating many pronunciation errors that resulted from mismatches between lexical pronunciations and how the speaker (voice talent) actually pronounced a word, while keeping consistency in labeling. Each speaker has his or her own unique pronunciations (and context-dependent variations), so that no one standard lexicon is able to cover all of the speakers’ variations. A subjective listening test showed that combining these two approaches resulted in perceived quality improvement for American English male and female voices.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Optimal Feature Set and Minimal Training Size for Pronunciation Adaptation in TTS

Text-to-Speech (TTS) systems rely on a grapheme-to-phoneme converter which is built to produce canonical, or statically stylized, pronunciations. Hence, the TTS quality drops when phoneme sequences generated by this converter are inconsistent with those labeled in the speech corpus on which the TTS system is built, or when a given expressivity is desired. To solve this problem, the present work...

متن کامل

A comparison of pronunciation modeling approaches for HMM-TTS

Hidden Markov model-based text-to-speech (HMM-TTS) systems are often trained on manual voice corpus phonetic transcriptions, despite the fact that because these manual pronunciations cannot be predicted with complete accuracy at synthesis time, the result is training/synthesis mismatch. In this paper, an alternate approach is proposed in which a set of manually written post-lexical effects (PLE...

متن کامل

Reducing the corpus-based TTS signal degradation due to speaker's word pronunciations

The goal of producing a corpus-based synthesizer with the owner’s voice can only be achieved if the system can handle recordings with less than ideal characteristics. One of the limitations is that a normal speaker does not always pronounce a word exactly as predicted by the language rules. In this work we compare two methods for handling variations on word pronunciation for corpus-based speech...

متن کامل

Pronunciation lexicon adaptation for TTS voice building

This paper describes reducing phone label errors in TTS voice building by means of modeling of speaker pronunciation variants. Each speaker has his or her own unique pronunciations (and context-dependent variations), so that no one standard lexicon is able to cover all of the speaker’s variations. Creating speaker-dependent pronunciation lexicons for automatic speech labeling of our TTS voice d...

متن کامل

Improving the accuracy of pronunciation p

This paper describes a technique which improves the accuracy of pronunciation prediction for unit selection TTS. It does this by performing an orthography-based context-dependent lookup on the unit database. During synthesis, the pronunciations of words which have matching contexts in the unit database are determined. Pronunciations not found using this method are determined using traditional l...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2004